import os
import numpy as np
import pandas as pd
import bokeh
import cmdstanpy
import arviz as az
import warnings
warnings.filterwarnings('ignore')
import bebi103
wdir = os.getcwd()
bokeh.io.output_notebook()
First we need to load the data. For convenience, since the time does not reset between growth events, we will compute this manually so we have a regularized time. Last, we will make a quick plot of the first event to get some background information that we would presumably have given we were doing the experiment, such as the order of magnitude of Caulobacter size and lifetime.
df = pd.read_csv('../data/caulobacter_growth_events.csv')
df = df.rename(columns={'time (min)':'t', 'area (µm²)':'area', 'growth event':'event'})
new_t = np.zeros(10321)
for event in df[df['bacterium'] == 1].event.unique():
df_event = df[(df['bacterium'] == 1) & (df['event'] == event)]
t0_event = df_event.head(1).t.values[0]
new_t_event = df_event.t - t0_event + 1
new_t[new_t_event.index] = new_t_event
for event in df[df['bacterium'] == 2].event.unique():
df_event = df[(df['bacterium'] == 2) & (df['event'] == event)]
t0_event = df_event.head(1).t.values[0]
new_t_event = df_event.t - t0_event + 1
new_t[new_t_event.index] = new_t_event
df['time'] = new_t
time = df['time']
area = df['area']
e = bokeh.plotting.figure(height=250, width=450,
x_axis_label='time (min)', y_axis_label='area (µm²)')
e.circle(df[(df['bacterium'] == 1) & (df['event'] == 0)].time,
df[(df['bacterium'] == 1) & (df['event'] == 0)].area)
bokeh.io.show(e)
We want to model the Caulobacter growth using two models: linear and exponential. The linear model is given by $a(t)=a_0+bt$, and the exponential by $a(t)=a_oe^{kt}$. For both models, we will use the same structure, where if $a(t)$ represents our model's growth function, we will model the area of the bacteria with the likelihood
$$a_i \sim \textrm{Norm}(a(t_i), \sigma).$$We want to use a hierarchical model with the growth rates for each growth event for each bacteria being given by $b^m_n$ and $k^m_n$ being the rates for bacteria $m$ and growth event $n$. We will have the hyperparameters $b$ and $k$, with the first level of the hierarchy modeling the variability for each cell from the hyperpriors being given by $b^m \sim \textrm{Norm}(b,\sigma_b)$ and $k^m \sim \textrm{Norm}(k,\sigma_k)$, and the second level of the hierarchy modeling the variability for each growth event from the cell's growth rates given by $b^m_n \sim \textrm{Norm}(b^m,\sigma_b)$ and $k^m_n \sim \textrm{Norm}(k^m,\sigma_k)$, where we are using the same $\sigma_{b/k}$ for each level of the hierarchy.
This means that we need priors for $a_0$, $b$, $k$, $\sigma$, $\sigma_{b}$, and $\sigma_{k}$.
In each model, $a_0$ represents the initial size of the bacteria at the start of each growth event. A quick search finds that Caulobacter are about 0.7 µm across and 2-3 µm long, so we estimate the area to be about 1.75 µm². Since we want the initial size, which is likely a bit lower, we will guess a value of about 1.5 and take the prior $a_0 \sim \textrm{Norm}(1.5, 0.75)$. This will likely give us some negative initial sizes, but sampling the real data will deal with this, and we would rather be too broad than too narrow.
We know that the Caulobacter lifetime is on the order of an hour or two, and since it grows and then divides into two, we would expect $b$ to be about $1.25\,/90=0.01388$. If the lifetime is about 30 minutes and the bacteria quadruples in size, this gives a rough upper bound of $4\,/30=0.13333$, and if the bacteria doesn't grow, this gives a lower bound of $0$. Thus, we will take the prior $b \sim \textrm{HalfNorm}(0.05)$.
We can use similar reasoning to come up with a prior for $k$. We expect $\ln(r)/t_f$, where $r$ is the ratio of final to initial areas and $t_f$ is the time at cell division, to be an estimate for $k$. For $r=4$ and $t_f=90$, we get an upper bound of about 0.01540, so we will take the prior $k \sim \textrm{HalfNorm}(0.0075)$.
The parameter $\sigma$ will be same for all $a_i$ since we expect the bacteria growth to be similarly noisy across all times since the area starts and ends at similar values (i.e., it varies over less than an order of magnitude within a growth event). Given our quick estimate for the average area of Caulobacter and the rough estimate that we expect between 0 and 30% variability between measurements, we will take the prior $\sigma \sim \textrm{HalfNorm}(0.3)$.
For $\sigma_b$ and $\sigma_k$ we will use similar priors, namely $\sigma_b \sim \textrm{HalfNorm}(0.025)$ and $\sigma_k \sim \textrm{HalfNorm}(0.0035)$.
The full model is then
$$\begin{align} a_{i\,n}^{\,\,m} &\sim \textrm{Norm}(a^m_n(t_i), \sigma) \\[0.5em] a^m_n(t) = a_0+b^m_nt \;\; &\textrm{OR} \;\; a^m_n(t)=a_0e^{k^m_nt} \\[0.5em] b &\sim \textrm{HalfNorm}(0.05) \\[0.5em] k &\sim \textrm{HalfNorm}(0.0075) \\[0.5em] b^m &\sim \textrm{Norm}(b,\sigma_b) \\[0.5em] k^m &\sim \textrm{Norm}(k,\sigma_k) \\[0.5em] b^m_n &\sim \textrm{Norm}(b^m,\sigma_b) \\[0.5em] k^m_n &\sim \textrm{Norm}(k^m,\sigma_k) \\[0.5em] \sigma_b &\sim \textrm{HalfNorm}(0.025) \\[0.5em] \sigma_k &\sim \textrm{HalfNorm}(0.0035) \\[0.5em] \sigma &\sim \textrm{HalfNorm}(0.3) \\[0.5em] a_0 &\sim \textrm{Norm}(1.5, 0.75) \\[0.5em] \end{align}$$Now we can do our prior predictive checks with Stan. Due to the size of the data set and to avoid plotting a huge number of points at once, we will only take 500 prior predictive samples, otherwise the browser has a very hard time.
data, df_stan = bebi103.stan.df_to_datadict_hier(df,
level_cols=['bacterium', 'event'],
data_cols=['time','area'])
data.update(dict(b_sigma=0.05, k_sigma=0.0075, sigma_b_sigma=0.025, sigma_k_sigma=0.0035,
sigma_sigma=0.3, a0_mu=1.5,a0_sigma=0.75, N_ppc=250))
os.chdir(wdir)
with bebi103.stan.disable_logging():
sm_prior = cmdstanpy.CmdStanModel(stan_file='./hw10.1_dc_prior.stan')
print('\nPrior Predictive Code:\n\n', sm_prior.code())
Prior Predictive Code:
functions {
real lin_growth(real t, real a0, real b) {
return a0 + b*t;
}
real exp_growth(real t, real a0, real k) {
return a0*e()^(k*t);
}
}
data {
// metadata
int N;
int J_1;
int J_2;
int index_1[J_2];
int index_2[N];
// data
vector[N] time;
vector[N] area;
// parameter specifications
real<lower=0> b_sigma;
real<lower=0> k_sigma;
real<lower=0> sigma_b_sigma;
real<lower=0> sigma_k_sigma;
real<lower=0> sigma_sigma;
real<lower=0> a0_mu;
real<lower=0> a0_sigma;
}
generated quantities {
vector[N] linear;
vector[N] exponential;
real a0_lin = normal_rng(a0_mu, a0_sigma);
real a0_exp = normal_rng(a0_mu, a0_sigma);
real b = fabs(normal_rng(0, b_sigma));
real k = fabs(normal_rng(0, k_sigma));
real sigma_lin = fabs(normal_rng(0, sigma_sigma));
real sigma_exp = fabs(normal_rng(0, sigma_sigma));
real sigma_b = fabs(normal_rng(0, sigma_b_sigma));
real sigma_k = fabs(normal_rng(0, sigma_k_sigma));
vector[J_1] b_m;
vector[J_1] k_m;
for (i in 1:J_1) {
b_m[i] = normal_rng(b, sigma_b);
k_m[i] = normal_rng(k, sigma_k);
}
vector[J_2] b_mn;
vector[J_2] k_mn;
for (i in 1:J_2) {
b_mn[i] = normal_rng(b_m[index_1[i]], sigma_b);
k_mn[i] = normal_rng(k_m[index_1[i]], sigma_k);
}
for(i in 1:N) {
linear[i] = normal_rng(lin_growth(time[i], a0_lin, b_mn[index_2[i]]), sigma_lin);
exponential[i] = normal_rng(exp_growth(time[i], a0_exp, k_mn[index_2[i]]), sigma_exp);
}
}
with bebi103.stan.disable_logging():
prior_samples = sm_prior.sample(iter_sampling=500, data=data,
fixed_param=True, show_progress=False)
prior_samples = az.from_cmdstanpy(prior=prior_samples, prior_predictive=['linear',
'exponential'])
lin_prior_samples = prior_samples.prior_predictive.linear.values.reshape(500,10321)
exp_prior_samples = prior_samples.prior_predictive.exponential.values.reshape(500,10321)
To plot our data, we will select from every sample a random growth event for a random bacteria and plot the data for all of our selections on top of one another for each model.
lin_prior = bokeh.plotting.figure(height=300, width=500, title='Linear',
x_axis_label='time (min)', y_axis_label='area (µm²)')
exp_prior = bokeh.plotting.figure(height=300, width=500, y_range=(-2,20), title='Exponential',
x_axis_label='time (min)', y_axis_label='')
for i in range(0,500,1):
rg = np.random.default_rng()
bacterium = rg.choice([1,2])
event = rg.choice(df[df['bacterium'] == bacterium].event.unique())
choice_index = df[(df['bacterium'] == bacterium) & (df['event'] == event)].index
lin_prior.circle(time[choice_index], lin_prior_samples[i][choice_index], alpha=0.1)
exp_prior.circle(time[choice_index], exp_prior_samples[i][choice_index], alpha=0.1)
bokeh.io.show(bokeh.layouts.row([lin_prior, exp_prior], sizing_mode='scale_both'))
These look pretty good. There are some negative values and some values that are certainly way too high, but most of our predictive data sets fall in a decent envelope of where we would expect them to be, and the parameter values will be refined through sampling. We would like to be able to do SBC, but given our computing and time constraints, we will not, and so we will move on to coding up the model and doing MCMC.
os.chdir(wdir)
with bebi103.stan.disable_logging():
sm = cmdstanpy.CmdStanModel(stan_file='./hw10.1.stan')
print('\nSampling and Posterior Predictive Code:\n\n', sm.code())
Sampling and Posterior Predictive Code:
functions {
real lin_growth(real t, real a0, real b) {
return a0 + b*t;
}
real exp_growth(real t, real a0, real k) {
return a0*e()^(k*t);
}
}
data {
// metadata
int N;
int J_1;
int J_2;
int index_1[J_2];
int index_2[N];
// data
vector[N] time;
vector[N] area;
// parameter specifications
real<lower=0> b_sigma;
real<lower=0> k_sigma;
real<lower=0> sigma_b_sigma;
real<lower=0> sigma_k_sigma;
real<lower=0> sigma_sigma;
real<lower=0> a0_mu;
real<lower=0> a0_sigma;
}
parameters {
real a0_lin_;
real a0_exp_;
real<lower=0> b_;
real<lower=0> k_;
real<lower=0> sigma_lin_;
real<lower=0> sigma_exp_;
real<lower=0> sigma_b_;
real<lower=0> sigma_k_;
vector[J_1] b_m_;
vector[J_1] k_m_;
vector[J_2] b_mn_;
vector[J_2] k_mn_;
}
transformed parameters {
real a0_lin = a0_mu + (a0_lin_ * a0_sigma);
real a0_exp = a0_mu + (a0_exp_ * a0_sigma);
real<lower=0> b = b_ * b_sigma;
real<lower=0> k = k_ * k_sigma;
real<lower=0> sigma_lin = sigma_lin_ * sigma_sigma;
real<lower=0> sigma_exp = sigma_exp_ * sigma_sigma;
real<lower=0> sigma_b = sigma_b_ * sigma_b_sigma;
real<lower=0> sigma_k = sigma_k_ * sigma_k_sigma;
vector[J_1] b_m = b + b_m_ * sigma_b;
vector[J_1] k_m = k + k_m_ * sigma_k;
vector[J_2] b_mn = b_m[index_1] + b_mn_ * sigma_b;
vector[J_2] k_mn = k_m[index_1] + k_mn_ * sigma_k;
}
model {
a0_lin_ ~ normal(0, 1);
a0_exp_ ~ normal(0, 1);
b_ ~ normal(0, 1);
k_ ~ normal(0, 1);
sigma_lin_ ~ normal(0, 1);
sigma_exp_ ~ normal(0, 1);
sigma_b_ ~ normal(0, 1);
sigma_k_ ~ normal(0, 1);
b_m_ ~ normal(0, 1);
k_m_ ~ normal(0, 1);
b_mn_ ~ normal(0, 1);
k_mn_ ~ normal(0, 1);
for (i in 1:N) {
area[i] ~ normal(lin_growth(time[i], a0_lin, b_mn[index_2[i]]), sigma_lin);
area[i] ~ normal(exp_growth(time[i], a0_exp, k_mn[index_2[i]]), sigma_exp);
}
}
generated quantities {
vector[N] linear;
vector[N] exponential;
vector[N] log_lik_lin;
vector[N] log_lik_exp;
for(i in 1:N) {
linear[i] = normal_rng(lin_growth(time[i], a0_lin, b_mn[index_2[i]]), sigma_lin);
exponential[i] = normal_rng(exp_growth(time[i], a0_exp, k_mn[index_2[i]]), sigma_exp);
log_lik_lin[i] = normal_lpdf(
area[i] | lin_growth(time[i], a0_lin, b_mn[index_2[i]]), sigma_lin);
log_lik_exp[i] = normal_lpdf(
area[i] | exp_growth(time[i], a0_exp, k_mn[index_2[i]]), sigma_exp);
}
}
with bebi103.stan.disable_logging():
samples = sm.sample(data=data, iter_warmup=1000, iter_sampling=1000, chains=4,
adapt_delta=0.99, max_treedepth=20, show_progress=True)
samples_lin = az.from_cmdstanpy(posterior=samples, posterior_predictive='linear',
log_likelihood='log_lik_lin')
samples_exp = az.from_cmdstanpy(posterior=samples, posterior_predictive='exponential',
log_likelihood='log_lik_exp')
samples = az.from_cmdstanpy(posterior=samples)
bebi103.stan.check_all_diagnostics(samples, max_treedepth=20,
parameters=['b', 'k', 'a0_lin', 'a0_exp',
'sigma_lin', 'sigma_exp',
'sigma_b', 'sigma_k'])
ESS for parameter b is 393.88140662263083. ESS for parameter k is 321.9978480887044. tail-ESS for parameter k is 329.1069618958909. ESS for parameter sigma_b is 51.378056008536625. tail-ESS for parameter sigma_b is 31.311932394372235. ESS for parameter sigma_k is 70.6785630622879. tail-ESS for parameter sigma_k is 193.9877798808258. ESS or tail-ESS below 100 per chain indicates that expectation values computed from samples are unlikely to be good approximations of the true expectation values. Rhat for parameter b is 1.0235288587484037. Rhat for parameter k is 1.0100248697838576. Rhat for parameter sigma_b is 1.1013161188134521. Rhat for parameter sigma_k is 1.0586613586966154. Rank-normalized Rhat above 1.01 indicates that the chains very likely have not mixed. 0 of 4000 (0.0%) iterations ended with a divergence. 0 of 4000 (0.0%) iterations saturated the maximum tree depth of 20. E-BFMI indicated no pathological behavior.
3
bokeh.io.show(bebi103.viz.corner(samples, parameters=['b', 'sigma_lin', 'a0_lin'],
xtick_label_orientation=np.pi/4))
bokeh.io.show(bebi103.viz.corner(samples, parameters=['k', 'sigma_exp', 'a0_exp'],
xtick_label_orientation=np.pi/4))
Our sampling looks okay. We are close to getting ESS of 400 for $b$ and $k$, and the parameters $\sigma_b$ and $\sigma_k$ are not hugely important. The Rhats are very close to 1.01 and are within the looser guideline of 1.1. We would get more samples if we had time as this would likely fix these issues, and we tried to run with 2000, but Jupyter Notebooks crashed at the very end and we lost our results, so we only had time to rerun with 1000.
Let's take a look at our posterior predictive checks. We will use the same strategy as with the prior predictive checks but only use every 20th sample out of 4000 so it's not too much to handle. We will also overlay the real data.
lin_post_samples = samples_lin.posterior_predictive.linear.values.reshape(4000, 10321)
exp_post_samples = samples_exp.posterior_predictive.exponential.values.reshape(4000, 10321)
lin_post = bokeh.plotting.figure(height=300, width=500, title='Linear',
x_axis_label='time (min)', y_axis_label='area (µm²)')
exp_post = bokeh.plotting.figure(height=300, width=500, title='Exponential',
x_axis_label='time (min)', y_axis_label='')
for i in range(0,4000,20):
rg = np.random.default_rng()
bacterium = rg.choice([1,2])
event = rg.choice(df[df['bacterium'] == bacterium].event.unique())
choice_index = df[(df['bacterium'] == bacterium) & (df['event'] == event)].index
lin_post.circle(time[choice_index], lin_post_samples[i][choice_index], alpha=0.1)
exp_post.circle(time[choice_index], exp_post_samples[i][choice_index], alpha=0.1)
lin_post.circle(time, area, color='orange', alpha=0.05)
exp_post.circle(time, area, color='orange', alpha=0.05)
bokeh.io.show(bokeh.layouts.row([lin_post, exp_post], sizing_mode='scale_both'))
These look pretty good. The eye test says that the linear model might be slightly better since it seems a bit tighter to the data compared to the exponential for larger times.
Let's look at our parameter estimates and see what we can get out of them.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
lin_est = pd.DataFrame(index=['b', 'sigma_lin', 'a0_lin', 'sigma_b'],
columns=['95% HDI Lower Bound', 'Median', '95% HDI Upper Bound'])
for param in ['b', 'sigma_lin', 'a0_lin', 'sigma_b']:
lin_est.loc[param, 'Median'] = np.median(samples_lin.posterior[param])
lin_est.loc[param, ['95% HDI Lower Bound', '95% HDI Upper Bound']
] = az.hdi(samples_lin, 0.95)[param].values
exp_est = pd.DataFrame(index=['k', 'sigma_exp', 'a0_exp', 'sigma_k'],
columns=['95% HDI Lower Bound', 'Median', '95% HDI Upper Bound'])
for param in ['k', 'sigma_exp', 'a0_exp', 'sigma_k']:
exp_est.loc[param, 'Median'] = np.median(samples_exp.posterior[param])
exp_est.loc[param, ['95% HDI Lower Bound', '95% HDI Upper Bound']
] = az.hdi(samples_exp, 0.95)[param].values
print('\nLinear:')
lin_est
print('\n\nExponential:')
exp_est
Linear:
| 95% HDI Lower Bound | Median | 95% HDI Upper Bound | |
|---|---|---|---|
| b | 0.008156 | 0.012741 | 0.016718 |
| sigma_lin | 0.084162 | 0.085302 | 0.086437 |
| a0_lin | 1.37243 | 1.37568 | 1.37883 |
| sigma_b | 0.002669 | 0.003111 | 0.003561 |
Exponential:
| 95% HDI Lower Bound | Median | 95% HDI Upper Bound | |
|---|---|---|---|
| k | 0.004223 | 0.006204 | 0.007965 |
| sigma_exp | 0.091026 | 0.09236 | 0.093577 |
| a0_exp | 1.4399 | 1.44283 | 1.4458 |
| sigma_k | 0.001201 | 0.001382 | 0.001578 |
As we guessed from the plot, $\sigma$ for the linear model is smaller than $\sigma$ for the exponential. A quantity that we may be interested in looking at is the ratio of the rate parameter to the scale parameter for how that rate varies from trial to trial, i.e. $b\,/\sigma_b$ and $k\,/\sigma_k$.
b_sb = lin_est.loc['b'].median() / lin_est.loc['sigma_b'].median()
k_sk = exp_est.loc['k'].median() / exp_est.loc['sigma_k'].median()
print(f'\nLinear b / \u03C3_b: {b_sb}')
print(f'Exponential k / \u03C3_k: {k_sk}')
Linear b / σ_b: 4.0951239188637585 Exponential k / σ_k: 4.487974534734405
It looks like our estimate for $b$ varies relatively less from trial to trial than $k$. We can take this as further indication that the linear model is a better fit to the data since we can get a more precise parameter estimate according to trial to trial variation. This means the model doesn't have to be as flexible to fit the data, which is likely indicative of a better model.
Let's try a model comparison using the LOO.
az.compare(dict(Linear=samples_lin, Exponential=samples_exp), ic='loo', scale='deviance')
| rank | loo | p_loo | d_loo | weight | se | dse | warning | loo_scale | |
|---|---|---|---|---|---|---|---|---|---|
| Linear | 0 | -21436.878463 | 85.652748 | 0.00000 | 0.881919 | 215.329750 | 0.000000 | False | deviance |
| Exponential | 1 | -19783.995603 | 93.882012 | 1652.88286 | 0.118081 | 187.550887 | 103.981476 | False | deviance |
The LOO significantly prefers the linear model, as we thought it might, although not overwhelmingly so. Interestingly, the p_loo column contains the effective number of parameters, and this value is higher for the exponential model than for the linear. This is likely because of what we were just discussing above regarding the values of $b/\sigma_b$ and $k/\sigma_k$, where the greater uncoupling of the trials for the exponential model results in a greater number of effective parameters, which likely hurts the LOO and causes the linear model to be preferred.
From everything that we have done, it seems like the linear model for Caulobacter growth is a better model than the exponential model, in opposition to what the authors of the paper found. The median parameter estimates for this model are given below (intervals can be found above).
b = lin_est.loc['b'].median()
a0_lin = lin_est.loc['a0_lin'].median()
sigma_lin = lin_est.loc['sigma_lin'].median()
sigma_b = lin_est.loc['sigma_b'].median()
print(f'b estimate: {b} µm²/min')
print(f'a_0 estimate: {a0_lin} µm²')
print(f'sigma estimate: {sigma_lin} µm²')
print(f'sigma_b estimate: {sigma_b} µm²/min')
b estimate: 0.0127412 µm²/min a_0 estimate: 1.37568 µm² sigma estimate: 0.0853019 µm² sigma_b estimate: 0.00311131 µm²/min
Of course, the big question hanging over everything is whether Caulobacter growth may actually be exponential, but on the time scale observed, the linear and exponential models are indistinguishable since an exponential is essentially linear for small $t$. We have a couple of things to say regarding this point.
First, there is some evidence that growth is indeed linear and not exponential from the larger fanning out in the posterior predictive check for the exponential than for the linear. This indicates that we are starting to get into the time domain where the non-linearity of the exponential is starting to kick in, and we can see that it does not quite match the data. Additionally, we know that on timescales, the exponential is essentially linear with $b=ka_0$. Let's look at this quantity versus our actual $b$ estimate.
ka_0 = exp_est.loc['k'].median()*exp_est.loc['a0_exp'].median()
print(f'ka\u2080: {ka_0}')
print(f'b: {b}')
ka₀: 0.00895076183045 b: 0.0127412
We see that these values are not in fact equal since $b$ is about 142% of $ka_0$, which is significant. This is further evidence that we are reaching the time domain where the non-linearity matters, and we prefer the linear model for this experimental data.
Second, the question of whether or not growth is linear or exponential could be explored further in another experiment by somehow increasing the time to division, but this would add confounding factors that would make the results unreliable.
Finally, and somewhat related to the last point, is that the question of whether or not growth is linear or exponential on a larger timescale is not really a good question to be asking in the first place. This is because the only timescale that actually matters and that we are interested in is the timescale of the Caulobacter life cycle, and we have shown that the linear growth model is preferable in this domain.
This homework was contributed to by Danny, Happy, and David.